BioJulia系列-BED

原文戳我

Out of date 该工具包似乎有点过时, 但是因为该包内容和方法比较简单, 可能也还不会影响使用。

关于BED格式介绍, 戳这里

BED.jl提供了BED格式的I/O操作, 支持tabix索引。

BED.Record结构

julia

# copied from https://github.com/BioJulia/BED.jl/blob/2e082b1a9f8c6c543e5544a90a7f970110ca7e6b/src/record.jl#L4
mutable struct Record
    # data and filled range
    data::Vector{UInt8}
    filled::UnitRange{Int}
    # number of columns
    ncols::Int
    # indexes
    chrom::UnitRange{Int}
    chromstart::UnitRange{Int}
    chromend::UnitRange{Int}
    name::UnitRange{Int}
    score::UnitRange{Int}
    strand::Int
    thickstart::UnitRange{Int}
    thickend::UnitRange{Int}
    itemrgb::UnitRange{Int}
    blockcount::UnitRange{Int}
    blocksizes::Vector{UnitRange{Int}}
    blockstarts::Vector{UnitRange{Int}}
end

julia

读写操作

最基础的读入操作:

julia

using BED

# Input
reader = open(BED.Reader, "file.bed")

# iterate
for rcd in reader
    # Do sth ...
    chrom = BED.chrom(rcd)
    # ...
end
close(reader)

julia

这种操作在读每一行的时候都做一次allocate, 内存占用大, 可以采取就地更新记录的思路, 节省内存:

julia

reader = open(BED.Reader, "file.bed")
record = BED.Record()
while !eof(reader)
    empty!(record)
    read!(reader, record)
    # do sth ...
end
close(reader)

julia

如果需要重复访问特定区间的记录, 构造一个IntervalCollection更有效(参考GenomicFeatures.jl):

julia

using BED
using GenomicFeatures

# Create an interval collection in memory.
icol = open(BED.Reader, "data.bed") do reader
    IntervalCollection(reader)
end

# Query overlapping records.
for interval in eachoverlap(icol, Interval("chrX", 40001, 51500))
    # A record is stored in the metadata field of an interval.
    record = metadata(interval)
    # ...
end

julia